Goto

Collaborating Authors

 postgresql database


MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

arXiv.org Artificial Intelligence

Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.


End-to-end Machine Learning Pipeline with Docker and Apache Airflow from scratch

#artificialintelligence

This post describes the implementation of a sample Machine Learning pipeline on Apache Airflow with Docker, covering all the steps required to setup a working local environment from scratch. Let us imagine to have a Jupyter Notebook with a polished Machine Learning experiment, including all the stages that lead from raw data to a fairly performant model. In our scenario, new input data is provided by daily batches, and the training procedure should be performed as soon as a new batch is provisioned, in order to tune the model's parameters to accomodate data changes. Moreover, experiment's parameters, training conditions and performances should be tracked with the aim to monitor the results of the different training sessions. Finally, the obtained models should be saved and made available to other systems to be used for inference, allowing, at the same time, version control over each generated model.


Online Learning with LakeFS and AWS

#artificialintelligence

Most tutorials/articles are usually focused on paper reviews and the performance of machine learning models in a lab. However, a significantly overlooked area is putting models into production and monitoring their performance, called online machine learning or online learning, where the model constantly learns from new data. The main advantage of online learning is that it prevents data from going "stale". Sometimes, the nature and distribution of the data are likely to change over time. If your model doesn't keep on improving, its performance will keep on decreasing.


PostgreSQL and Machine Learning

#artificialintelligence

I will show you how to apply Machine Learning algorithms on data from the PostgreSQL database to get insights and predictions. I will use an Automated Machine Learning (AutoML) supervised. It is an open-source python package. Thanks to AutoML I will get quick access to many ML algorithms: Decision Tree, Logistic Regression, Random Forest, Xgboost, Neural Network. The AutoML will handle feature engineering as well.


End to End Machine Learning: From Data Collection to Deployment

#artificialintelligence

This started out as a challenge. With a friend of mine, we wanted to see if it was possible to build something from scratch and push it to production. In this post, we'll go through the necessary steps to build and deploy a machine learning application. This starts from data collection to deployment and the journey, as you'll see it, is exciting and fun . Before we begin, let's have a look at the app we'll be building: As you see, this web app allows a user to evaluate random brands by writing reviews. While writing, the user will see the sentiment score of his input updating in real-time along with a proposed rating from 1 to 5. The user can then change the rating in case the suggested one does not reflect his views, and submit. You can think of this as a crowd sourcing app of brand reviews with a sentiment analysis model that suggests ratings that the user can tweak and adapt afterwards. To build this application we'll follow these steps: All the code is available in our github repository and organized in independant directories, so you can check it, run it and improve it. Disclaimer: The scripts below are meant for educational purposes only: scrape responsibly. In order to train a sentiment classifier, we need data. We can sure download open source datasets for sentiment analysis tasks such as Amazon Polarity or IMDB movie reviews but for the purpose of this tutorial, we'll build our own dataset.


r/MachineLearning - [Project] pgANN Fast Approximate Nearest Neighbor (ANN) searches with a PostgreSQL database.

#artificialintelligence

Hi, we did experiment with ES, using range queries on the vectors and boolean querying them and also tried using LSH/MinHash to save a signature for each vector. Did you have a different approach in mind? Also, you're correct about L-1 & L2 distances being poor metrics in this dimensionality, but our goal was to fetch a subset of (say) a few thousand "good enough" results - from a pool of a tens of millions - that can then be re-ranked with cosine or such metric. Unfortunately, there are no easy wins in ANN and this works well enough for us. We hope others can benefit as well.


Data Lake Machine Learning Models with Python and Dremio

#artificialintelligence

Amazon Simple Storage Service (S3) is an object storage service that offers high availability and reliability, easy scaling, security, and performance. Many companies all around the world use Amazon S3 to store and protect their data. PostgreSQL is an open-source object-relational database system. In addition to many useful features, PostgreSQL is highly extensible, and this allows to organize work with the most complicated data workloads easily. In this article, we will show how to load data into Amazon S3 and PostgreSQL, then how to connect these sources to Dremio, and how to perform data curation.


Report: AWS to unveil new machine learning tools, database at re:Invent

#artificialintelligence

Amazon Web Services is expected to announce a new version of its PostgreSQL database for its cloud customers during its AWS re:Invent conference next week, according to a Fortune report. AWS hopes its PostgreSQL database will appeal to larger companies, some of which have complained about the price of Oracle and Microsoft offerings. Amazon also intends to announce new machine learning tools for AWS developers, according to the report. AWS' annual conference is sure to reveal a bevy of new capabilities. Cloud providers often try to one-up other services providers with new, flashy capabilities and re:Invent will likely be no different.